Course Description
Character strings can turn up in all stages of a data science project. You might have to clean messy string input before analysis, extract data that is embedded in text or automatically turn numeric results into a sentence to include in a report. Perhaps the strings themselves are the data of interest, and you need to detect and match patterns within them. This course will help you master these tasks by teaching you how to pull strings apart, put them back together and use stringr to detect, extract, match and split strings using regular expressions, a powerful way to express patterns.
You’ll start with some basics: how to enter strings in R, how to control how numbers are transformed to strings, and finally how to combine strings together to produce output that combines text and nicely formatted numbers.
# Define line1
line1 <- "The table was a large one, but the three were all crowded
together at one corner of it:"
# Define line2
line2 <- '"No room! No room!" they cried out when they saw Alice
coming.'
# Define line3
line3 <- '"There\'s plenty of room!" said Alice indignantly,
and she sat down in a large arm-chair at one end of the table.'
You can escape quotes inside strings using a backslash.
Take a look at line2
## [1] "\"No room! No room!\" they cried out when they saw Alice \ncoming."
Even though you used single quotes so you didn’t have to escape any double quotes, when R prints it, you’ll see escaped double quotes (\"
)! R doesn’t care how you defined the string, it only knows what the string represents, in this case, a string with double quotes inside.
When you ask R for line2
it is actually calling print(line2)
and the print()
method for strings displays strings as you might enter them. If you want to see the string it represents you’ll need to use a different function: writeLines()
.
## [1] "The table was a large one, but the three were all crowded \ntogether at one corner of it:"
## [2] "\"No room! No room!\" they cried out when they saw Alice \ncoming."
## [3] "\"There's plenty of room!\" said Alice indignantly, \nand she sat down in a large arm-chair at one end of the table."
## The table was a large one, but the three were all crowded
## together at one corner of it:
## "No room! No room!" they cried out when they saw Alice
## coming.
## "There's plenty of room!" said Alice indignantly,
## and she sat down in a large arm-chair at one end of the table.
## The table was a large one, but the three were all crowded
## together at one corner of it: "No room! No room!" they cried out when they saw Alice
## coming. "There's plenty of room!" said Alice indignantly,
## and she sat down in a large arm-chair at one end of the table.
## hello
## 🌍
The function
cat()
is very similar towriteLines()
, but by default separates elements with a space, and will attempt to convert non-character objects to a string. We won’t use it in this course, but you might see it in other people’s code.
You might have been surprised at the output from the last part of the last exercise. How did you get two lines from one string, and how did you get that little globe? The key is the \
.
A sequence in a string that starts with a \
is called an escape sequence and allows us to include special characters in our strings. You saw one escape sequence in the first exercise: \"
is used to denote a double quote.
In "hello\n\U1F30D"
there are two escape sequences: \n
gives a newline, and \U
followed by up to 8 hex digits sequence denotes a particular Unicode character.
Unicode is a standard for representing characters that might not be on your keyboard. Each available character has a Unicode code point: a number that uniquely identifies it. These code points are generally written in hex notation, that is, using base 16 and the digits 0-9 and A-F. You can find the code point for a particular character by looking up a code chart. If you only need four digits for the codepoint, an alternative escape sequence is \u
.
When R comes across a \
it assumes you are starting an escape, so if you actually need a backslash in your string you’ll need the sequence \\
.
## To have a \ you need \\
## This is a really
## really
## really
## long string
## नमस्ते दुनिया
You can read about a few other escape sequences in the help page ?Quotes.
The scientific argument to format()
controls whether the numbers are displayed in fixed (scientific = FALSE)
or scientific (scientific = TRUE)
format.
For example, if the smallest number is 0.0011, and digits = 1
, then 0.0011 requires 3 places after the decimal to represent it to 1 significant digit, 0.001. Every other number will be formatted to 3 places after the decimal point.
So, how many decimal places will you get if 1.0011 is the smallest number? You’ll find out in this exercise.
## [1] "0.001" "0.011" "1.000"
## [1] "1" "2" "1"
## [1] " 4.0" "-1.9" " 3.0" "-5.0"
## [1] " 72" " 1030" " 10292" "1189192"
## [1] "0.12000000000" "0.98000000000" "0.00001910000" "0.00000000002"
## [1] " 72" " 1030" " 10292" "1189192"
## 72
## 1030
## 10292
## 1189192
## 72
## 1030
## 10292
## 1189192
## 72
## 1,030
## 10,292
## 1,189,192
The function formatC()
provides an alternative way to format numbers based on C style syntax.
Rather than a scientific argument, formatC()
has a format
argument that takes a code representing the required format. The most useful are:
"f"
for fixed,"e"
for scientific, and"g"
for fixed unless scientific saves spaceWhen using scientific format, the digits argument behaves like it does in format()
; it specifies the number of significant digits. However, unlike format()
, when using fixed format, digits is the number of digits after the decimal point. This is more predictable than format()
, because the number of places after the decimal is fixed regardless of the values being formatted.
formatC()
also formats numbers individually, which means you always get the same output regardless of other numbers in the vector.
The flag
argument allows you to provide some modifiers that, for example, force the display of the sign (flag = "+"
), left align numbers (flag = "-"
) and pad numbers with leading zeros (flag = "0"
).
## [1] "0.0" "0.0" "1.0"
## [1] "1.0" "2.0" "1.0"
## [1] "4.0" "-1.9" "3.0" "-5.0"
## [1] "+4.0" "-1.9" "+3.0" "-5.0"
## [1] "0.12" "0.98" "1.9e-05" "2e-11"
## [1] "$72" "$1,030" "$10,292" "$1,189,192"
## [1] "+4.0%" "-1.9%" "+3.0%" "-5.0%"
## [1] "2010: +4.0%,2011: -1.9%,2012: +3.0%,2013: -5.0%"
Specifying
sep = ""
is so common, there is actually another functionpaste0()
that works likepaste()
but always pastes elements together without a separator between them.
# Define the names vector
income_names <- c("Year 0", "Year 1", "Year 2", "Project Lifetime")
# Create pretty_income
pretty_income <- format(income, digits = 2, big.mark = ",")
# Create dollar_income
dollar_income <- paste("$", pretty_income, sep = "")
# Create formatted_names
formatted_names <- format(income_names, justify = "right")
# Create rows
rows <- paste(formatted_names, dollar_income, sep = " ")
# Write rows
writeLines(rows)
## Year 0 $ 72
## Year 1 $ 1,030
## Year 2 $ 10,292
## Project Lifetime $1,189,192
If you wanted the dollar signs right next to the numbers, you could format the incomes with
trim = TRUE
, paste on the$
, then format again as a string withjustify = "right"
.
toppings <- c("anchovies", "artichoke", "bacon", "breakfast bacon", "Canadian bacon",
"cheese", "chicken", "chili peppers", "feta", "garlic", "green peppers",
"grilled onions", "ground beef", "ham", "hot sauce", "meatballs",
"mushrooms", "olives", "onions", "pepperoni", "pineapple", "sausage",
"spinach", "sun-dried tomato", "tomatoes")
# Randomly sample 3 toppings
my_toppings <- sample(toppings, size = 3)
# Print my_toppings
my_toppings
## [1] "meatballs" "garlic" "cheese"
# Paste "and " to last element: my_toppings_and
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")
# Collapse with comma space: these_toppings
these_toppings <- paste(my_toppings_and, collapse = ", ")
# Add rest of sentence: my_order
my_order <- paste("I want to order a pizza with ", these_toppings, ".", sep = "")
# Order pizza with writeLines()
writeLines(my_order)
## I want to order a pizza with meatballs, garlic, and cheese.
Time to meet stringr! You’ll start by learning about some stringr functions that are very similar to some base R functions, then how to detect specific patterns in strings, how to split strings apart and how to find and replace parts of strings.
str_c()
paste()
.sep
and collapse
arguments.There are two key ways str_c()
differs from paste()
.
sep = ""
, as opposed to a space, so it’s more like paste0()
.paste()
turns missing values into the string “NA”, whereas str_c()
propagates missing values. That means combining any strings with a missing value will result in another missing value.This behavior is nice because you learn quickly when you might have missing values, rather than discovering later weird “NA”s inside your strings. Another
stringr
function that is useful when you may have missing values, is `str_replace_na() which replaces missing values with any string you choose.
`str_length()
## [1] 5 5
This is very similar to the base function nchar()
but str_length()
handles factors in an intuitive way, whereas nchar()
will just return an error.
library(stringr)
library(babynames)
library(dplyr)
# Extracting vectors for boys' and girls' names
babynames_2014 <- filter(babynames, year == 2014)
boy_names <- filter(babynames_2014, sex == "M")$name
girl_names <- filter(babynames_2014, sex == "F")$name
# Take a look at a few boy_names
head(boy_names)
## [1] "Noah" "Liam" "Mason" "Jacob" "William" "Ethan"
## [1] 4 4 5 5 7 5
## [1] 0.3360496
## [1] 4 4 5 5 7 5
The average length of the girls’ names in 2014 is about 1/3 of a character longer. Just be aware this is a naive average where each name is counted once, not weighted by how many babies recevied the name. A better comparison might be an average weighted by the
n
column inbabynames
str_sub()
string
, is a vector of strings.start
and end
specify the boundaries of the piece to extract in characters.For example, str_sub(x, 1, 4)
asks for the substring starting at the first character, up to the fourth character, or in other words the first four characters. Try it with my Batman’s name:
## [1] "Bruc" "Wayn"
Both start
and end
can be negative integers, in which case, they count from the end of the string. For example, str_sub(x, -4, -1)
, asks for the substring starting at the fourth character from the end, up to the first character from the end, i.e. the last four characters. Again, try it with Batman:
## [1] "ruce" "ayne"
## boy_first_letter
## A B C D E F G H I J K L M N O
## 1450 651 767 996 549 185 332 401 234 1388 1290 536 913 424 207
## P Q R S T U V W X Y Z
## 230 56 778 804 771 43 160 174 56 252 379
## boy_last_letter
## a b c d e f g h i j k l m n o
## 421 104 92 436 1145 66 81 582 704 57 349 942 389 4664 729
## p q r s t u v w x y z
## 32 19 1011 825 291 81 71 34 86 696 119
## girl_first_letter
## A B C D E F G H I J K L M N O
## 3099 698 941 808 932 209 345 468 373 1429 1689 1121 1744 752 143
## P Q R S T U V W X Y Z
## 301 38 830 1366 681 28 214 85 62 294 500
## girl_last_letter
## a b c d e f g h i j k l m n o
## 6624 20 13 81 3111 8 21 1936 1580 12 31 450 115 2600 104
## p q r s t u v w x y z
## 3 2 291 326 208 59 6 17 49 1432 51
"A"
is the most popular first letter for both boys and girls, and the most popular last letter for girls."n"
.substr()
a base R function that is similar to str_sub()
str_sub()
is the ability to use negative indexes to count from the end of a string.stringr
functions that look for matches
All take a pattern argument
str_detect()
str_subset()
str_count()
str_detect()
TRUE
for elements that contain the pattern and FALSE
otherwise.## [1] FALSE TRUE TRUE
## logi [1:14026] FALSE FALSE FALSE FALSE FALSE FALSE ...
## [1] 16
## [1] "Uzziah" "Ozzie" "Ozzy" "Uzziel" "Jazz"
## [6] "Chazz" "Izzy" "Azzam" "Izzac" "Izzak"
## [11] "Fabrizzio" "Jazziel" "Azzan" "Izzaiah" "Muizz"
## [16] "Yazziel"
## # A tibble: 16 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2014 M Uzziah 67 0.0000328
## 2 2014 M Ozzie 62 0.0000304
## 3 2014 M Ozzy 57 0.0000279
## 4 2014 M Uzziel 21 0.0000103
## 5 2014 M Jazz 20 0.00000980
## 6 2014 M Chazz 17 0.00000833
## 7 2014 M Izzy 16 0.00000784
## 8 2014 M Azzam 14 0.00000686
## 9 2014 M Izzac 13 0.00000637
## 10 2014 M Izzak 8 0.00000392
## 11 2014 M Fabrizzio 7 0.00000343
## 12 2014 M Jazziel 6 0.00000294
## 13 2014 M Azzan 5 0.00000245
## 14 2014 M Izzaiah 5 0.00000245
## 15 2014 M Muizz 5 0.00000245
## 16 2014 M Yazziel 5 0.00000245
That last example is another common use of
str_detect()
subsetting a data frame to rows where the values in a column contain the pattern of interest. In this case it lets us see these double-z names are pretty rare. For example, even the most popular, Uzziah, only accounted for 0.003% of boys born in 2014.
Since detecting strings with a pattern and then subsetting out those strings is such a common operation, stringr provides a function str_subset()
that does that in one step.
For example, let’s repeat our search for “pepper” in our pizzas using str_subset()
:
## [1] "pepperoni" "sausage and green peppers"
We get a new vector of strings, but it only contains those original strings that contained the pattern.
str_subset()
can be easily confused with str_extract()
. str_extract()
returns a vector of the same length as that of the input vector, but with only the parts of the strings that matched the pattern.
## [1] "Uzziah" "Ozzie" "Ozzy" "Uzziel" "Jazz"
## [6] "Chazz" "Izzy" "Azzam" "Izzac" "Izzak"
## [11] "Fabrizzio" "Jazziel" "Azzan" "Izzaiah" "Muizz"
## [16] "Yazziel"
## [1] "Izzabella" "Jazzlyn" "Jazzlynn" "Lizzie" "Izzy"
## [6] "Lizzy" "Mazzy" "Izzabelle" "Jazzmine" "Jazzmyn"
## [11] "Jazzelle" "Jazzmin" "Izzah" "Jazzalyn" "Jazzmyne"
## [16] "Izzabell" "Jazz" "Mazzie" "Alyzza" "Izza"
## [21] "Izzie" "Jazzlene" "Lizzeth" "Jazzalynn" "Jazzy"
## [26] "Alizzon" "Elizzabeth" "Jazzilyn" "Jazzlynne" "Jizzelle"
## [31] "Izzabel" "Izzabellah" "Izzibella" "Jazzabella" "Jazzabelle"
## [36] "Jazzel" "Jazzie" "Jazzlin" "Jazzlyne" "Aizza"
## [41] "Brizza" "Ezzah" "Fizza" "Izzybella" "Rozzlyn"
## [1] "Unique" "Uma" "Unknown" "Una" "Uriah" "Ursula" "Unity"
## [8] "Umaiza" "Urvi" "Ulyana" "Ula" "Udy" "Urwa" "Ulani"
## [15] "Umaima" "Umme" "Ugochi" "Ulyssa" "Umika" "Uriyah" "Ubah"
## [22] "Umaira" "Umi" "Ume" "Urenna" "Uriel" "Urijah" "Uyen"
## [1] "Umaiza"
Only one girls’ name that starts with “U” and contains a “z”. Have you ever met an “Umaiza”?
str_count()
If you count the occurrences of "pepper"
in your pizzas, you’ll find no occurrences in the first, and one each in the second and third,
## [1] 0 1 1
Perhaps a little more interesing is to count how many "e"
s occur in each order
## [1] 3 2 5
## [1] "Aaradhana"
str_split()
In this exercise pull apart a date range, something like "23.01.2017 - 29.01.2017"
, into separate variables for the start of the range, "23.01.2017"
, and the end of the range, "29.01.2017"
.
If the simplify
argument is FALSE
(the default) you’ll get back a list of the same length as that of the input vector. More commonly, you’ll want to pull out the first piece (or second piece etc.) from every element, which is easier if you specify simplify = TRUE
and get a matrix as output.
## [[1]]
## [1] "23.01.2017" "29.01.2017"
##
## [[2]]
## [1] "30.01.2017" "06.02.2017"
## [,1] [,2]
## [1,] "23.01.2017" "29.01.2017"
## [2,] "30.01.2017" "06.02.2017"
## [,1] [,2] [,3]
## [1,] "23" "01" "2017"
## [2,] "30" "01" "2017"
Use the
simplify = TRUE
argument when you want to split each string into the same number of pieces.
Generally, specifying simplify = TRUE
will give you output that is easier to work with, but you’ll always get n
pieces (even if some are empty, ""
).
Sometimes, you want to know how many pieces a string can be split into, or you want to do something with every piece before moving to a simpler structure. This is a situation where you don’t want to simplify and you’ll have to process the output with something like lapply()
.
As an example, you’ll be performing some simple text statistics on your lines from Alice’s Adventures in Wonderland from Chapter 1. Your goal will be to calculate how many words are in each line, and the average length of words in each line.
To do these calculations, you’ll need to split the lines into words. One way to break a sentence into words is to split on an empty space " "
. This is a little naive because, for example, it wouldn’t pick up words separated by a newline escape sequence like in "two\nwords"
, but since this situation doesn’t occur in your lines, it will do.
## [[1]]
## [1] 18
##
## [[2]]
## [1] 12
##
## [[3]]
## [1] 21
## [[1]]
## [1] 3.944444
##
## [[2]]
## [1] 4.333333
##
## [[3]]
## [1] 4.428571
The word lengths aren’t quite right because you were including some punctuation symbols. One way to deal with that is to replace them first with
str_replace()
Sometimes, it’s easier to just replace the parts you don’t want with an empty string ""
. This is also a common strategy to clean strings up, for example, to remove unwanted punctuation or white space.
In this exercise you’ll pull out some numbers by replacing the part of the string that isn’t a number, you’ll also play with the format of some phone numbers. Pay close attention to the difference between str_replace()
and str_replace_all()
.
ids <- c("ID#: 192", "ID#: 118", "ID#: 001")
# Replace "ID#: " with ""
id_nums <- str_replace(ids, "ID#: ", "")
# Turn id_nums into numbers
id_ints <- as.numeric(id_nums)
# Some (fake) phone numbers
phone_numbers <- c("510-555-0123", "541-555-0167")
# Use str_replace() to replace "-" with " "
str_replace(phone_numbers, "-", " ")
## [1] "510 555-0123" "541 555-0167"
## [1] "510 555 0123" "541 555 0167"
## [1] "510.555.0123" "541.555.0167"
You’ve covered a lot of stringr functions in this chapter:
str_c()
str_length()
str_sub()
str_detect()
str_subset()
str_count()
str_split()
str_replace()
As a review we’ve got a few tasks for you to do with some DNA sequences. We’ve put three sequences, corresponding to three genes, from the genome of Yersinia pestis – the bacteria that causes bubonic plague – into the vector genes.
Each string represents a gene, each character a particular nucleotide: Adenine, Cytosine, Guanine or Thymine.
We aren’t going to tell you which function to use. It’s up to you to choose the right one and specify the needed arguments. Good luck!
## [1] 441 462 993
## [1] 118 117 267
## [1] "TTAAGGAACGATCGTACGCATGATAGGGTTTTGCAGTGATATTAGTGTCTCGGTTGACTGGATCTCATCAATAGTCTGGATTTTGTTGATAAGTACCTGCTGCAATGCATCAATGGATTTACACATCACTTTAATAAATATGCTGTAGTGGCCAGTGGTGTAATAGGCCTCAACCACTTCTTCTAAGCTTTCCAATTTTTTCAAGGCGGAAGGGTAATCTTTGGCACTTTTCAAGATTATGCCAATAAAGCAGCAAACGTCGTAACCCAGTTGTTTTGGGTTAACGTGTACACAAGCTGCGGTAATGATCCCTGCTTGCCGCATCTTTTCTACTCTTACATGAATAGTTCCGGGGCTAACAGCGAGGTTTTTGGCTAATTCAGCATAGGGTGTGCGTGCATTTTCCATTAATGCTTTCAGGATGCTGCGATCGAGATTATCGATCTGATAAATTTCACTCAT"
## [1] "TT_G_GT___TT__TCC__TCTTTG_CCC___TCTCTGCTGG_TCCTCTGGT_TTTC_TGTTGG_TG_CGTC__TTTCT__T_TTTC_CCC__CCGTTG_GC_CCTTGTGCG_TC__TTGTTG_TCC_GTTTT_TG_TTGC_CCGC_G___GTGTC_T_TTCTG_GCTGCCT___CC__CCGCCCC___GCGT_CTTGGG_T___TC_GGCTTTTGTTGTTCG_TCTGTTCT__T__TGGCTGC__GTT_TC_GGT_G_TCCCCGGC_CC_TG_GTGG_TGTC_CG_TT__CC_C_GGCC_TTC_GCGT__GTTCGTCC__CTCTGGGCC_TG__GT_TTTCTGT_G____CCC_GCTTCTTCT__TTT_TCCGCT___TGTTC_GC__C_T_TTC_GC_CT_CC__GCGT_CTGCC_CTT_TC__CGTT_TGTC_GCC_T"
## [2] "TT__GG__CG_TCGT_CGC_TG_T_GGGTTTTGC_GTG_T_TT_GTGTCTCGGTTG_CTGG_TCTC_TC__T_GTCTGG_TTTTGTTG_T__GT_CCTGCTGC__TGC_TC__TGG_TTT_C_C_TC_CTTT__T___T_TGCTGT_GTGGCC_GTGGTGT__T_GGCCTC__CC_CTTCTTCT__GCTTTCC__TTTTTTC__GGCGG__GGGT__TCTTTGGC_CTTTTC__G_TT_TGCC__T___GC_GC___CGTCGT__CCC_GTTGTTTTGGGTT__CGTGT_C_C__GCTGCGGT__TG_TCCCTGCTTGCCGC_TCTTTTCT_CTCTT_C_TG__T_GTTCCGGGGCT__C_GCG_GGTTTTTGGCT__TTC_GC_T_GGGTGTGCGTGC_TTTTCC_TT__TGCTTTC_GG_TGCTGCG_TCG_G_TT_TCG_TCTG_T___TTTC_CTC_T"
## [3] "_TG______C__TTT_TCC_____C__C__C___TC_GCTTCGT____TC_TTCTTTTCCCGCC__TT_G_GC__C__CTTGGCTTG_TCG__GTCC_GGCTCCT_TTTTG_GCCGTGTGGGTG_TGG__CCC__G_T__CCTTTCTGGTTCTG_G___GCGGT_C_GGT____GTT__GTC_TTGCCGG_TTC__CTTTTG__GTTGT_C_TTC_TT_GCG__GTGG___CGT____CCTT_GGGCGTTTTG_TTTTGGTGCTG_CC__GGGGTGT_T_CCC_T_TG___GC_TTGCGCCC_G_TG__G_TCGCCTG_GTGCT_TTC_TTCTGT_T_TGT_G_TC_GTGGG_TTGGG__CGGGTT_TGGGGG_CGGTG__CGT__CCTGGCTT_CCTG___TCG_CTGTT__C__G_TTT_TGC_GCG_TT___G___CTG__GCGGCG_TC_GTGCTG_GTTTGGTGTG__GCCTTTCCTGCCGG_TC_T_TTC_GTTT_TCC_C_GTG___GCCTGCGGGCC_G_TTCCCTG_TTT_G_TGCT___GGCCGTG__CGTGC__TTGCC___G_GTT_GGTGCTGTCTTCCTT_T_GGG_TTGGTGGC___TTGGC_G_TGGTC__TCCC_TG_TGTTCGTGCGCC_G_TT_TG_TG_TTGG_CCTCTCCG_GTGCGG__GGTTTCTCTGG_TT___CGGCG_C_TT_TTGTCTGG__CCC__T_TTGG__G_TGCCTTTG_G_T_TCTTCT_TGGG__TTCGTGTTG_TGCCG__GCTCTT__GCGTC_GTT_GCCCTG_CTGGCG_TG__G_CCGCTTGG__CTGG__TGGC_TC__TC_CTGTTGCGCGGTG___TGCC_C___CT_TCGGGGG_GGT_TTGGTC_GTCCCGCTT_GTG_TGTT_TTGCTGC_G___C__C_T_TTGGTC_GGTGC__TGTGGTGTTTGGGGCCCTG___TC_GCG_G___GTTG_TGGCCTGCTGT__"
As the final exercise we want to expose you to the power of combining operations. You’ll complete two tasks:
You’ll turn a vector of full names, like “Bruce Wayne”, into abbreviated names like “B. Wayne”. This requires combining str_split()
, str_sub()
and str_c()
.
You’ll compare how many boy names end in “ee” compared to girl names. This requires combining str_sub()
with str_detect()
along with the base function table()
.
# --- Task 1 ----
# Define some full names
names <- c("Diana Prince", "Clark Kent")
# Split into first and last names
names_split <- str_split(names, pattern = fixed(" "), simplify = TRUE)
# Extract the first letter in the first name
abb_first <- str_sub(names_split[, 1], 1, 1)
# Combine the first letter ". " and last name
str_c(abb_first,". ", names_split[,2])
## [1] "D. Prince" "C. Kent"
# --- Task 2 ----
# Use all names in babynames_2014
all_names <- babynames_2014$name
# Get the last two letters of all_names
last_two_letters <- str_sub(all_names, -2, -1)
# Does the name end in "ee"?
ends_in_ee <- str_detect(last_two_letters, pattern = fixed("ee"))
# Extract rows and "sex" column
sex <- babynames_2014$sex[ends_in_ee]
# Display result as a table
table(sex)
## sex
## F M
## 572 84
In this chapter you’ll learn about regular expressions, a language for describing patterns in strings. By combining regular expressions with the stringr functions you’ll greatly increase your power to manipulate strings.
rebus
provides START
and END
shortcuts to specify regular expressions that match the start and end of the string. These are also known as anchors. You can try it out just by typing
START
You’ll see the output <regex> ^
. The <regex>
denotes this is a special regex object and it has the value ^
. ^
is the character used in the regular expression language to denote the start of a string.
The special operator provided by rebus, %R%
allows you to compose complicated regular expressions from simple pieces. When you are reading rebus code, think of %R%
as “then”. For example, you could combine START
with c
,
START %R% "c"
to match the pattern “the start of string then a c”, or in other words: strings that start with c. In `rebus, if you want to match a specific character, or a specific sequence of characters, you simply specify them as a string, e.g. surround them with “.
## <regex> $
For that last example, rebus also provides the function
exactly(x)
which is a shortcut forSTART %R% x %R% END
that matches only if the string is exactlyx
.
In a regular expression you can use a wildcard to match a single character, no matter what the character is. In rebus it is specified with ANY_CHAR
.
## <regex> .
For example, "c" %R% ANY_CHAR %R% "t"
will look for patterns like
"c_t"
where the blank can be any character.Where would the matches to "c" %R% ANY_CHAR %R% "t"
be?
Test your intuition by running:
Notice that ANY_CHAR
will match a space character (c t
in tic toc
). It will also match numbers or punctuation symbols, but ANY_CHAR
will only ever match one character, which is why we get no match in coat
.
You can pass a regular expression as the pattern argument to any stringr
function that has the pattern
argument.
It now also makes sense to add str_extract()
to your repertoire. It returns just the part of the string that matched the pattern:
## [1] 96
## part_with_q
## qa qe qi qm qo qu
## 1 1 2 2 1 89
## count_of_q
## 0 1
## 13930 96
## [1] 0.006844432
The rebus::or()
allows us to specify a set of alternatives, which may be single characters or character strings, to be matched. Each alternative is passed as a separate argument.
For example, or("grey", "gray")
allows us to detect either the American or British spelling:
Since these two words only differ by one character you could equivalently specify this match with "gr" %R% or("e", "a") %R% "y"
, that is “a gr
followed by, an e
or an a
, then a y
”.
In regular expressions a character class is a way of specifying “match one (and only one) of the following characters”. In rebus
you can specify the set of allowable characters using the function char_class()
.
This is another way you could specify an alternate spelling, for example, specifying “a gr
followed by, either an a
or e
, followed by a y
”:
A negated character class matches “any single character that isn’t one of the following”, and in rebus
is specified with negated_char_class()
.
Unlike in other places in a regular expression you don’t need to escape characters that might otherwise have a special meaning inside character classes. If you want to match .
you can include .
directly, e.g. char_class(".")
. Matching a -
is a bit trickier. If you need to do it, just make sure it comes first in the character class.
## <regex> [aeiouAEIOU]
## [1] 2.385356
## [1] 0.4000153
The names in
boy_names
are on average about 40% vowels.
The rebus functions one_or_more()
, zero_or_more()
and optional()
can be used to wrap parts of a regular expression to allow a pattern to match a variable number of times.
Take our vowels pattern from the last exercise. You can pass it to one_or_more()
to create the pattern that matches “one or more vowels”. Take a look with these interjections:
You’ll see we can match the single o
in ow
, the double o
in ooh
and the string of e
s followed by the a
in yeeeah
, but nothing in shh
because there isn’t a single vowel.
In contrast zero_or_more()
will match even if there isn’t an occurrence, try
Since both yeeeah
and shh
start without a vowel, they match “zero vowels”, and since regular expressions are lazy, they look no further and return the start of the string as a match.
## [1] "555-555-0191" NA "(555) 555 0191" "555.555.0191"
## [[1]]
## [1] "555-555-0191"
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "(555) 555 0191"
##
## [[4]]
## [1] "555.555.0191" "555.555.0192"
## [1] "19YOM" "31 YOF" "82 YOM" "33 YOF" "10YOM" "53 YO F" "13 MOF"
## [8] "14YR M" "55YOM" "5 YOM"
age <- dgt(1, 2)
# Extract age and make numeric
ages_numeric <- as.numeric(str_extract(age_gender, age))
# Replace age and units with ""
genders <- str_replace(age_gender,
pattern = age %R% unit,
replacement = "")
# Replace extra spaces
genders_clean <- str_replace_all(genders,
pattern = SPC,
replacement = "")
# Extract units
time_units <- str_extract(age_gender, pattern = unit)
# Extract first word character
time_units_clean <- str_extract(time_units, pattern = WRD)
# Turn ages in months to years
ages_years <- ifelse(time_units_clean == "Y", ages_numeric, ages_numeric/12)
Now for two advanced ways to use regular expressions along with stringr: selecting parts of a match (a.k.a capturing) and referring back to parts of a match (a.k.a back-referencing). You’ll also learn to deal with and strings or patterns that contain unicode characters (e.g. é).
In rebus
, to denote a part of a regular expression you want to capture, you surround it with the function capture()
. For example, a simple pattern to match an email address might be,
f you want to capture the part before the @
, you simply wrap that part of the regular expression in capture()
:
The part of the string that matches hasn’t changed, but if we pull out the match with str_match()
we get access to the captured piece:
## [,1] [,2]
## [1,] "wolverine@xmen.com" "wolverine"
hero_contacts <- c("(wolverine@xmen.com)", "wonderwoman@justiceleague.org", "thor@avengers.com")
# Capture part between @ and . and after .
email <- capture(one_or_more(WRD)) %R%
"@" %R% capture(one_or_more(WRD)) %R%
DOT %R% capture(one_or_more(WRD))
# Check match hasn't changed
str_view(hero_contacts, pattern = email)
## [,1] [,2] [,3] [,4]
## [1,] "wolverine@xmen.com" "wolverine" "xmen" "com"
## [2,] "wonderwoman@justiceleague.org" "wonderwoman" "justiceleague" "org"
## [3,] "thor@avengers.com" "thor" "avengers" "com"
## [1] "xmen" "justiceleague" "avengers"
Actually, detecting an email address can be really hard see this discussion for more details.
## [1] "Call me at 555-555-0191"
## [2] "123 Main St"
## [3] "(555) 555 0191"
## [4] "Phone: 555.555.0191 Mobile: 555.555.0192"
# Add capture() to get digit parts
phone_pattern <- capture(three_digits) %R% zero_or_more(separator) %R%
capture(three_digits) %R% zero_or_more(separator) %R%
capture(four_digits)
# Pull out the parts with str_match()
phone_numbers <- str_match(contact, phone_pattern)
# Put them back together
str_c(
"(",
phone_numbers[, 2],
") ",
phone_numbers[, 3],
"-",
phone_numbers[, 4])
## [1] "(555) 555-0191" NA "(555) 555-0191" "(555) 555-0191"
If you wanted to extract beyond the first phone number, e.g. The second phone number in the last string, you could use
str_match_all()
. But, likestr_split()
it will return a list with one component for each input string, and you’ll need to uselapply()
to handle the result.
## [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS "
## [2] "31 YOF FELL FROM TOILET HITITNG HEAD SUSTAINING A CHI "
## [3] "ANKLE STR. 82 YOM STRAINED ANKLE GETTING OUT OF BED "
## [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"
## [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "
## [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "
## [7] "13 MOF TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
## [8] "14YR M PLAYING FOOTBALL; DX KNEE SPRAIN "
## [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "
## [10] "5 YOM ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"
## [,1] [,2] [,3] [,4]
## [1,] "19YOM" "19" "YO" "M"
## [2,] "31 YOF" "31" "YO" "F"
## [3,] "82 YOM" "82" "YO" "M"
## [4,] "33 YOF" "33" "YO" "F"
## [5,] "10YOM" "10" "YO" "M"
## [6,] "53 YO F" "53" "YO" "F"
## [7,] "13 MOF" "13" "MO" "F"
## [8,] "14YR M" "14" "YR" "M"
## [9,] "55YOM" "55" "YO" "M"
## [10,] "5 YOM" "5" "YO" "M"
## [,1] [,2] [,3] [,4]
## [1,] "19YOM" "19" "Y" "M"
## [2,] "31 YOF" "31" "Y" "F"
## [3,] "82 YOM" "82" "Y" "M"
## [4,] "33 YOF" "33" "Y" "F"
## [5,] "10YOM" "10" "Y" "M"
## [6,] "53 YO F" "53" "Y" "F"
## [7,] "13 MOF" "13" "M" "F"
## [8,] "14YR M" "14" "Y" "M"
## [9,] "55YOM" "55" "Y" "M"
## [10,] "5 YOM" "5" "Y" "M"
The combination of
capture()
andstr_match()
is powerful for extracting pieces of text.
Backreferences can be useful in matching because they allow you to find repeated patterns or words. Using a backreference requires two things: you need to capture()
the part of the pattern you want to reference, and then you refer to it with REF1
.
Take a look at this pattern: capture(LOWER) %R% REF1
. It matches and captures any lower case character, then is followed by the captured character: it detects repeated characters regardless of what character is repeated. To see it in action try this:
If you capture more than one thing you can refer to them with REF2
, REF3
etc. up to REF9
, counting the captures from the left of the pattern.
In addition to matching repeated values, backreferences can also be used for replacement.
str_replace()
takes three arguments
string
a vector of strings to do the replacements in,pattern
that identifies the parts of strings to replace andreplacement
the thing to use as a replacement.
string
,## [1] "Call me at 555-555-0191"
## [2] "123 Main St"
## [3] "(555) 555 0191"
## [4] "Phone: 555.555.0191 Mobile: 555.555.0192"
## [1] "Call me at X55-555-0191"
## [2] "X23 Main St"
## [3] "(X55) 555 0191"
## [4] "Phone: X55.555.0191 Mobile: 555.555.0192"
## [1] "Call me at XXX-XXX-XXXX"
## [2] "XXX Main St"
## [3] "(XXX) XXX XXXX"
## [4] "Phone: XXX.XXX.XXXX Mobile: XXX.XXX.XXXX"
## [1] "Call me at XXX-XXX-XXXX"
## [2] "... Main St"
## [3] "(***) *** ****"
## [4] "Phone: ___.___.____ Mobile: ___.___.____"
Using
""
for the replacement value is a great way to cut unwanted bits from a string.
The replacement
argument to str_replace()
can also include backreferences. This works just like specifying patterns with backreferences, except the capture happens in the pattern
argument, and the backreference is used in the replacement
argument.
## [1] "hhello" "ssweet" "kkitten"
capture(ANY_CHAR)
will match the first character no matter what it is. Then the replacement str_c(REF1, REF1)
combines the captured character with itself, in effect doubling the first letter of each string.
## [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE CARELESSLY PLAYING FOOTBALL W/ FRIENDS "
## [2] "31 YOF FELL FROM TOILET HITITNG HEAD CARELESSLY SUSTAINING A CHI "
## [3] "ANKLE STR. 82 YOM STRAINED ANKLE CARELESSLY GETTING OUT OF BED "
## [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"
## [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "
## [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "
## [7] "13 MOF CARELESSLY TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
## [8] "14YR M CARELESSLY PLAYING FOOTBALL; DX KNEE SPRAIN "
## [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "
## [10] "5 YOM CARELESSLY ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"
## [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE NEVER PLAYING FOOTBALL W/ FRIENDS "
## [2] "31 YOF FELL FROM TOILET HITITNG HEAD HAPPILY SUSTAINING A CHI "
## [3] "ANKLE STR. 82 YOM STRAINED ANKLE AFTERWARDS GETTING OUT OF BED "
## [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"
## [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "
## [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "
## [7] "13 MOF TRUTHFULLY TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
## [8] "14YR M FERVENTLY PLAYING FOOTBALL; DX KNEE SPRAIN "
## [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "
## [10] "5 YOM FEROCIOUSLY ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"
Replacement combined with backreferences can be really useful for reformatting text data.
http://www.fileformat.info/info/unicode/char/search.htm
## [1] "μ"
## 👏
## 😂
## 😏
## [1] "61"
## [1] "3bc"
## [1] "1f600"
## 😏
## [1] "1f60f"
## 😏
Things can get tricky when some characters can be specified two ways, for example è, an e with a grave accent, can be specified either with the single code point \u00e8
or the combination of a \u0065
and a combining grave accent \u0300
. They look the same:
## è
## è
But, specifying the single code point only matches that version:
The stringi
package
stri_trans_nfc()
composes characters with combining accents into a single character.stri_trans_nfd()
decomposes character with accents into separate letter and accent characters.You can see how the characters differ by looking at the hexadecimal codes.
## [1] "065" "300"
## [1] "e8"
In Unicode, an accent is known as a diacritic Unicode Property, and you can match it using the rebus
value UP_DIACRITIC
.
## [1] "Nguyễn Nhạc" "Nguyễn Huệ" "Nguyễn Quang Toản"
## [1] "Nguyễn Nhạc" "Nguyễn Huệ" "Nguyễn Quang Toản"
A related problem is matching a single character. You’ve used ANY_CHAR
to do this up until now, but it will only match a character represented by a single code point. Take these three names:
## Adele
## Adèle
## Adèle
They look the similar, but this regular expression only matches two of them:
because in the third name è is represented by two code points. The unicode standard has a concept of a grapheme that represents a display character, but may be composed of many code points. To match any grapheme you can use GRAPHEME
.
## [1] "Nguyễn Nhạc" "Nguyễn Huệ" "Nguyễn Quang Toản"
Practice your string manipulation skills on a couple of case studies. You’ll also learn a few new skills, reading strings into R and handling problems of case (e.g. A versus a).
library(stringi)
library(stringr)
# Read play in using stri_read_lines()
earnest <- stri_read_lines("datasets/importance-of-being-earnest.txt")
# Detect start and end lines
start <- which(str_detect(earnest, fixed("START OF THE PROJECT")))
end <- which(str_detect(earnest, fixed("END OF THE PROJECT")))
# Get rid of gutenberg intro text
earnest_sub <- earnest[(start + 1):(end - 1)]
# Detect first act
lines_start <- which(str_detect(earnest_sub, fixed("FIRST ACT")))
# Set up index
intro_line_index <- 1:(lines_start - 1)
# Split play into intro and play
intro_text <- earnest_sub[intro_line_index]
play_text <- earnest_sub[-intro_line_index]
# Take a look at the first 20 lines
writeLines(play_text[1:20])
## FIRST ACT
##
##
## SCENE
##
##
## Morning-room in Algernon's flat in Half-Moon Street. The room is
## luxuriously and artistically furnished. The sound of a piano is heard in
## the adjoining room.
##
## [Lane is arranging afternoon tea on the table, and after the music has
## ceased, Algernon enters.]
##
## Algernon. Did you hear what I was playing, Lane?
##
## Lane. I didn't think it polite to listen, sir.
##
## Algernon. I'm sorry for that, for your sake. I don't play
## accurately--any one can play accurately--but I play with wonderful
## expression. As far as the piano is concerned, sentiment is my forte. I
The first thing you might notice when you look at your vector play_text
is there are lots of empty lines. They don’t really affect your task so you might want to remove them. The easiest way to find empty strings is to use the stringi function stri_isempty()
, which returns a logical you can use to subset the not-empty strings:
So, how are you going to find the elements that indicate a character starts their line? Consider the following lines
## [1] "Algernon. I'm sorry for that, for your sake. I don't play"
## [2] "accurately--any one can play accurately--but I play with wonderful"
## [3] "expression. As far as the piano is concerned, sentiment is my forte. I"
## [4] "keep science for Life."
## [5] "Lane. Yes, sir."
## [6] "Algernon. And, speaking of the science of Life, have you got the"
The first line is for Algernon
, the next three strings are continuations of that line, then line 5
is for Lane
and line 6
for Algernon
.
How about looking for lines that start with a word followed by a .
?
## [1] "Algernon." "Lane." "Jack." "Cecily." "Ernest."
## [6] "University." "Gwendolen." "July." "Chasuble." "Merriman."
## [11] "Sunday." "Mr." "London." "Cardew." "Opera."
## [16] "Markby." "Oxonian."
looks like your pattern wasn’t 100% successful. It missed
Lady Bracknell
, and picked up lines starting withUniversity.
,July.
and a few others.
The pattern “starts with a capital letter, has some other characters then a full stop” wasn’t specific enough. You ended up matching lines that started with things like University.
, July.
, London.
, and you missed characters like Lady Bracknell
and Miss Prism
.
Let’s take a different approach. You know the characters names from the play introduction. So, try specifically looking for lines that start with their names. You’ll find the or1()
function from the rebus
package helpful. It specifies alternatives but rather than each alternative being an argument like in or()
, you can pass in a vector of alternatives.
# Create vector of characters
characters <- c("Algernon", "Jack", "Lane", "Cecily", "Gwendolen",
"Chasuble", "Merriman", "Lady Bracknell", "Miss Prism")
# Match start, then character name, then .
pattern_3 <- START %R% or1(characters) %R% DOT
# View matches of pattern_3
str_view(play_lines, pattern = pattern_3, match = TRUE)
## [1] "Algernon." "Lane." "Jack." "Cecily."
## [5] "Gwendolen." "Lady Bracknell." "Miss Prism." "Chasuble."
## [9] "Merriman."
## who
## Algernon. Cecily. Chasuble. Gwendolen.
## 201 154 42 102
## Jack. Lady Bracknell. Lane. Merriman.
## 219 84 21 17
## Miss Prism.
## 41
Algernon and Jack get the most lines, more than ten times more than Merriman who has the fewest. If you were looking really closely you might have noticed the pattern didn’t pick up the line
Jack and Algernon [Speaking together.]
which you really should be counting as a line for both Jack and Algernon. One solution might be to look for these"Speaking together"
lines, parse out the characters, and add to your counts.
A simple solution to working with strings in mixed case, is to simply transform them into all lower or all upper case. Depending on your choice, you can then specify your pattern in the same case.
For example, while looking for "cat"
finds no matches in the following string,
transforming the string to lower case first ensures all variations match.
See if you can find the catcidents that also involved dogs. You’ll see a new rebus
function called whole_word()
. The argument to whole_word()
will only match if it occurs as a word on it’s own, for example whole_word("cat")
will match cat
in "The cat "
and "cat."
but not in `“caterpillar”.
## [1] "79yOf Fractured fingeR tRiPPED ovER cAT ANd fell to FlOOr lAst nIGHT AT HOME*"
## [2] "21 YOF REPORTS SUS LACERATION OF HER LEFT HAND WHEN SHE WAS OPENING A CAN OF CAT FOOD JUST PTA. DX HAND LACERATION%"
## [3] "87YOF TRIPPED OVER CAT, HIT LEG ON STEP. DX LOWER LEG CONTUSION "
## [4] "bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS"
## [5] "42YOF TO ER FOR BACK PAIN AFTER PUTTING DOWN SOME CAT LITTER DX: BACK PAIN, SCIATICA"
## [6] "4YOf DOg jUst hAd PUpPieS, Cat TRIED 2 get PuPpIes, pT THru CaT dwn stA Irs, LoST foOTING & FELl down ~12 stePS; MInor hEaD iNJuRY"
## [1] "bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS"
## [2] "4YOf DOg jUst hAd PUpPieS, Cat TRIED 2 get PuPpIes, pT THru CaT dwn stA Irs, LoST foOTING & FELl down ~12 stePS; MInor hEaD iNJuRY"
## [3] "unhelmeted 14yof riding her bike with her dog when she saw a cat and sw erved c/o head/shoulder/elbow pain.dx: minor head injury,left shoulder"
## [4] "Rt Shoulder Strain.26Yof Was Walking Dog On Leash And Dot Saw A Cat And Pulled Leash."
## [5] "67 YO F WENT TO WALK DOG, IT STARTED TO CHASE CAT JERKED LEASH PULLED H ER OFF PATIO, FELL HURT ANKLES. DX BILATERAL ANKLE FRACTURES"
## [6] "46yof taking dog outside, dog bent her fingers back on a door. dog jerk ed when saw cat. hand holding leash caught on door jamb/ct hand"
## [7] "PUSHING HER UTD WITH SHOTS DOG AWAY FROM THE CAT'S BOWL&BITTEN TO FINGE R>>PW/DOG BITE"
## [8] "DX R SH PN: 27YOF W/ R SH PN X 5D. STATES WAS YANK' BY HER DOG ON LEASH W DOG RAN AFTER CAT; WORSE' PN SINCE. FULL ROM BUT VERY PAINFUL TO MOVE"
## [9] "39Yof dog pulled her down the stairs while chasing a cat dx: rt ankle inj"
## [10] "44Yof Walking Dog And The Dof Took Off After A Cat And Pulled Pt Down B Y The Leash Strained Neck"
Rather than transforming the input strings, another approach is to specify that the matching should be case insensitive. This is one of the options to the stringr
regex()
function.
Take our previous example,
To match the pattern cat
in a case insensitive way, we wrap our pattern in regex()
and specify the argument ignore_case = TRUE
,
Notice that the matches retain their original case and any variant of cat
matches.
## character(0)
Finally, you might want to transform strings to a common case. You’ve seen you can use str_to_upper()
and str_to_lower()
, but there is also str_to_title()
which transforms to title case, in which every word starts with a capital letter.
This is another situation where stringi
functions offer slightly more functionality than the stringr
functions. The stringi
function stri_trans_totitle()
allows a specification of the type
which, by default, is "word"
, resulting in title case, but can also be "sentence"
to give sentence case: only the first word in each sentence is capitalized.
## 79yOf Fractured fingeR tRiPPED ovER cAT ANd fell to FlOOr lAst nIGHT AT HOME*
## 21 YOF REPORTS SUS LACERATION OF HER LEFT HAND WHEN SHE WAS OPENING A CAN OF CAT FOOD JUST PTA. DX HAND LACERATION%
## 87YOF TRIPPED OVER CAT, HIT LEG ON STEP. DX LOWER LEG CONTUSION
## bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS
## 42YOF TO ER FOR BACK PAIN AFTER PUTTING DOWN SOME CAT LITTER DX: BACK PAIN, SCIATICA
## 79Yof Fractured Finger Tripped Over Cat And Fell To Floor Last Night At Home*
## 21 Yof Reports Sus Laceration Of Her Left Hand When She Was Opening A Can Of Cat Food Just Pta. Dx Hand Laceration%
## 87Yof Tripped Over Cat, Hit Leg On Step. Dx Lower Leg Contusion
## Blunt Chest Trauma, R/O Rib Fx, R/O Cartilage Inj To Rib Cage; 32Yom Walking Dog, Dog Took Off After Cat,Fell,Struck Chest On Steps,Hit Ribs
## 42Yof To Er For Back Pain After Putting Down Some Cat Litter Dx: Back Pain, Sciatica
## 79Yof Fractured Finger Tripped Over Cat And Fell To Floor Last Night At Home*
## 21 Yof Reports Sus Laceration Of Her Left Hand When She Was Opening A Can Of Cat Food Just Pta. Dx Hand Laceration%
## 87Yof Tripped Over Cat, Hit Leg On Step. Dx Lower Leg Contusion
## Blunt Chest Trauma, R/O Rib Fx, R/O Cartilage Inj To Rib Cage; 32Yom Walking Dog, Dog Took Off After Cat,Fell,Struck Chest On Steps,Hit Ribs
## 42Yof To Er For Back Pain After Putting Down Some Cat Litter Dx: Back Pain, Sciatica
## 79Yof fractured finger tripped over cat and fell to floor last night at home*
## 21 Yof reports sus laceration of her left hand when she was opening a can of cat food just pta. Dx hand laceration%
## 87Yof tripped over cat, hit leg on step. Dx lower leg contusion
## Blunt chest trauma, r/o rib fx, r/o cartilage inj to rib cage; 32yom walking dog, dog took off after cat,fell,struck chest on steps,hit ribs
## 42Yof to er for back pain after putting down some cat litter dx: back pain, sciatica